Statistical Models for Frequent Terms in Text

نویسندگان

  • Edoardo M. Airoldi
  • William W. Cohen
چکیده

In this paper we present statistical models for text which treat words with higher frequencies of occurrence in a sensible manner, and perform better than widely used models based on the multinomial distribution on a wide range of classification tasks, with two or more classes. Our models are based on the Poisson and Negative-Binomial distributions, which keep desirable properties of simplicity and analytic tractability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Methods for Frequent Terms in Text: Models of Contagion and the ∆ Statistic

Most statistical approaches to modeling text implicitly assume that informative words are rare. This assumption is usually appropriate for topical retrieval and classification tasks; however, in non-topical classification and soft-clustering problems where classes and latent variables relate to sentiment or author, informative words can be frequent. In this paper we present a comprehensive set ...

متن کامل

Bayesian Models for Frequent Terms in Text

In this paper we present statistical models for text which treat words with higher frequencies of occurrence in a sensible manner, and perform better than widely used models based on the multinomial distribution on a wide range of classification tasks, with two or more classes. Our models are based on the Poisson and Negative-Binomial distributions, which keep desirable properties of simplicity...

متن کامل

An Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches

Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

یافتن الگوهای مکرّر در قرآن کریم به‌‌کمک روش‌‌های متن‌‌کاوی

Quran’s Text differs from any other texts in terms of its exceptional concepts, ideas and subjects. To recognize the valuable implicit patterns through a vast amount of data has lately captured the attention of so many researchers. Text Mining provides the grounds to extract information from texts and it can help us reach our objective in this regard. In recent years, Text Mining on Quran and e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004